{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "view-in-github"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/zihangdai/xlnet/blob/master/notebooks/colab_imdb_gpu.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "fnOHnctkG6kW"
   },
   "source": [
    "# XLNet IMDB movie review classification project\n",
    "\n",
    "This notebook is for classifying the [imdb sentiment dataset](https://ai.stanford.edu/~amaas/data/sentiment/).  It will be easy to edit this notebook in order to run all of the classification tasks referenced in the [XLNet paper](https://arxiv.org/abs/1906.08237). Whilst you cannot expect to obtain the state-of-the-art results in the paper on a GPU, this model will still score very highly. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "2mBzLdrdzodb"
   },
   "source": [
    "## Setup\n",
    "Install dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "hRHRPImGUth7"
   },
   "outputs": [],
   "source": [
    "! pip install sentencepiece"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "jy8gUsPuJNyw"
   },
   "source": [
    "Download the pretrained XLNet model and unzip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "HfPDGsUtHKG0"
   },
   "outputs": [],
   "source": [
    "# only needs to be done once\n",
    "! wget https://storage.googleapis.com/xlnet/released_models/cased_L-24_H-1024_A-16.zip\n",
    "! unzip cased_L-24_H-1024_A-16.zip "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "4uUwjq3BJRbu"
   },
   "source": [
    "Download extract the imdb dataset - surpessing output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "QOGRICbOIsU8"
   },
   "outputs": [],
   "source": [
    "! wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\n",
    "! tar zxf aclImdb_v1.tar.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "yGY_ggUUMrwU"
   },
   "source": [
    "Git clone XLNet repo for access to run_classifier and the rest of the xlnet module"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "-r190eYVMpiG"
   },
   "outputs": [],
   "source": [
    "! git clone https://github.com/zihangdai/xlnet.git"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "jDP-IaVuPC-z"
   },
   "source": [
    "## Define Variables\n",
    "Define all the dirs: data, xlnet scripts & pretrained model. \n",
    "If you would like to save models then you can authenticate a GCP account and use that for the OUTPUT_DIR & CHECKPOINT_DIR - you will need a large amount storage to fix these models. \n",
    "\n",
    "Alternatively it is easy to integrate a google drive account, checkout this guide for [I/O in colab](https://colab.research.google.com/notebooks/io.ipynb) but rememeber these will take up a large amount of storage. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "y7N_xVwavQlV"
   },
   "outputs": [],
   "source": [
    "SCRIPTS_DIR = 'xlnet' #@param {type:\"string\"}\n",
    "DATA_DIR = 'aclImdb' #@param {type:\"string\"}\n",
    "OUTPUT_DIR = 'proc_data/imdb' #@param {type:\"string\"}\n",
    "PRETRAINED_MODEL_DIR = 'xlnet_cased_L-24_H-1024_A-16' #@param {type:\"string\"}\n",
    "CHECKPOINT_DIR = 'exp/imdb' #@param {type:\"string\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "jR6euqwL1KBV"
   },
   "source": [
    "## Run Model\n",
    "This will set off the fine tuning of XLNet. There are a few things to note here:\n",
    "\n",
    "\n",
    "1.   This script will train and evaluate the model\n",
    "2.   This will store the results locally on colab and will be lost when you are disconnected from the runtime\n",
    "3.   This uses the large version of the model (base not released presently)\n",
    "4.   We are using a max seq length of 128 with a batch size of 8 please refer to the [README](https://github.com/zihangdai/xlnet#memory-issue-during-finetuning) for why this is.\n",
    "5. This will take approx 4hrs to run on GPU.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "CEMuT6LU0avg"
   },
   "outputs": [],
   "source": [
    "train_command = \"python xlnet/run_classifier.py \\\n",
    "  --do_train=True \\\n",
    "  --do_eval=True \\\n",
    "  --eval_all_ckpt=True \\\n",
    "  --task_name=imdb \\\n",
    "  --data_dir=\"+DATA_DIR+\" \\\n",
    "  --output_dir=\"+OUTPUT_DIR+\" \\\n",
    "  --model_dir=\"+CHECKPOINT_DIR+\" \\\n",
    "  --uncased=False \\\n",
    "  --spiece_model_file=\"+PRETRAINED_MODEL_DIR+\"/spiece.model \\\n",
    "  --model_config_path=\"+PRETRAINED_MODEL_DIR+\"/xlnet_config.json \\\n",
    "  --init_checkpoint=\"+PRETRAINED_MODEL_DIR+\"/xlnet_model.ckpt \\\n",
    "  --max_seq_length=128 \\\n",
    "  --train_batch_size=8 \\\n",
    "  --eval_batch_size=8 \\\n",
    "  --num_hosts=1 \\\n",
    "  --num_core_per_host=1 \\\n",
    "  --learning_rate=2e-5 \\\n",
    "  --train_steps=4000 \\\n",
    "  --warmup_steps=500 \\\n",
    "  --save_steps=500 \\\n",
    "  --iterations=500\"\n",
    "\n",
    "! {train_command}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "VvhqD-sO0Kyh"
   },
   "source": [
    "## Running & Results\n",
    "These are the results that I got from running this experiment\n",
    "### Params\n",
    "*    --max_seq_length=128 \\\n",
    "*    --train_batch_size= 8 \n",
    "\n",
    "### Times\n",
    "*   Training: 1hr 11mins\n",
    "*   Evaluation: 2.5hr\n",
    "\n",
    "### Results\n",
    "*  Most accurate model on final step\n",
    "*  Accuracy: 0.92416, eval_loss: 0.31708\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "XUW2avFM_fi_"
   },
   "source": [
    "### Model\n",
    "\n",
    "*   The trained model checkpoints can be found in 'exp/imdb'\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [],
   "include_colab_link": true,
   "name": "XLNet-imdb-GPU.ipynb",
   "provenance": [],
   "toc_visible": true,
   "version": "0.3.2"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
